User Agent String is a HTTP request header that describes the software acting on the user’s behalf







Original purpose of UAS was “content negotiation” with the server

UAS contain many useful data points




UAS have use cases beyond content negotiation




UAS-based features can encode many useful user characteristics



UAS-based features: in many cases, parsing & simple one-hot encoding won’t do



Let’s embed UAS into a low-dimensional space!

  • Can be done in a variety of ways

  • fastText (Bojanowski et al. 2016) is a particularly useful algorithm:

    • not data-hungry
    • works well on short documents
    • fast to train
    • out-of-vacabularly words not a problem

Using the official fasttext Python library in R is easy, thanks to reticulate

# Install `fasttext` first (see https://fasttext.cc/docs/en/support.html)

# Load the `reticulate` package
require(reticulate)

# Make sure `fasttext` is available to R:
py_module_available("fasttext") 
## [1] TRUE

# Load `fasttext`:
ft <- import("fasttext")

# Then call the required methods using the `$` notation, e.g.: `ft$train_supervised`

fastText transformers can be trained in both unsupervised and supervised modes


Example dataset

A sample of 200,000 unique UAS from the whatismybrowser.com database



Unsupervised training of a fastText transformer


m_unsup <- ft$train_unsupervised(input = "./data/train_data_unsup.txt",
                                 model = "skipgram",
                                 lr = 0.05, 
                                 dim = 32L, # vector dimension
                                 ws = 3L, 
                                 minCount = 1L,
                                 minn = 2L, 
                                 maxn = 6L, 
                                 neg = 3L, 
                                 wordNgrams = 2L, 
                                 loss = "ns",
                                 epoch = 100L, 
                                 thread = 10L)

Getting the UAS vector representations

test_data <- readLines("./data/test_data_unsup.txt")

emb_unsup <- test_data %>% 
  lapply(., function(x) {
    m_unsup$get_sentence_vector(text = x) %>% # returns average vector for a UAS
      t(.) %>% as.data.frame(.)
  }) %>% 
  bind_rows(.) %>% 
  setNames(., paste0("f", 1:32))

emb_unsup[1:3, 1:10]
##      f1       f2    f3    f4      f5     f6      f7    f8     f9    f10
## 1 0.197 -0.03726 0.147 0.153  0.0423 0.0488  0.0196 0.132 0.1946  0.186
## 2 0.182  0.00307 0.147 0.101  0.0326 0.0847 -0.0174 0.108 0.1957  0.171
## 3 0.101 -0.28220 0.189 0.202 -0.1623 0.2622  0.1386 0.106 0.0733 -0.035

The resultant embeddings are quite useful

Adding labels to data



Supervised training of a fastText transformer


m_sup <- ft$train_unsupervised(input = "./data/train_data_sup.txt",
                               lr = 0.05, 
                               dim = 32L, # vector dimension
                               ws = 3L, 
                               minCount = 1L,
                               minCountLabel = 10L, # min label occurence
                               minn = 2L, 
                               maxn = 6L, 
                               neg = 3L, 
                               wordNgrams = 2L, 
                               loss = "softmax", # loss function
                               epoch = 100L, 
                               thread = 10L)

The resultant embeddings are even better!

Take-home messages